- Goals of model selection and regularization
- Subset selection
- Ridge regression
- Lasso regression
Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).
10/13/2020
Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).
With a set of variables in hand, the goal is to select the best model. Why not include all the variables?
Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.
In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.
We need a strategy to build a model in ways that account for the trade-off between bias and variance: subset selection, shrinkage, dimension reduction.
The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
Recall that the least squares fitting procedure estimates \(\beta_0, \dots, \beta_p\) using the values that minimize
\[RSS = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2\]
Ridge Regression is a modification of the least squares criteria that minimizes
\[\underbrace{\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2}_\text{traditional objective function of LS} + \underbrace{\lambda \sum_{j=1}^p \beta_j^2}_\text{shrinkage penalty} = RSS + \lambda \sum_{j=1}^p \beta_j^2\]
where \(\lambda > 0\) is a tuning parameter, to be determined separately.
\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]
\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]
Simulated data with \(n = 50\), \(p = 45\) all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.
scale()
in R
Ridge regression does have an obvious disadvantage: it does not perform variable selection and it includes all \(p\) predictors in the final model.
Lasso Regression is a modification of the least squares criteria that minimizes
\[\underbrace{\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2}_\text{traditional objective function of LS} + \underbrace{\lambda \sum_{j=1}^p |\beta_j|}_\text{shrinkage penalty} = RSS + \lambda \sum_{j=1}^p |\beta_j|\]
The Lasso uses an \(\mathcal{\ell}_1\) penalty instead of an \(\mathcal{\ell}_2\) penalty. The \(\mathcal{\ell}_1\) norm of a coefficient vector \(\mathbf{\beta}\) is given by \(||\mathbf{\beta}||_1 = \sum_{j=1}^p |\beta_j|\).
Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? One can show that the lasso and ridge regression coefficient estimates solve the constrained optimization problems
\[ \begin{aligned} &\text{Ridge: } \text{argmin}_{\mathbf{\beta}} \ \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 \quad \text{subject to } \sum_{j=1}^p \beta_j^2 \leq s \\ &\text{Lasso: } \text{argmin}_{\mathbf{\beta}} \ \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 \quad \text{subject to } \sum_{j=1}^p |\beta_j| \leq s \end{aligned} \]
It depends!
The idea is to solve the Ridge or Lasso over a grid of possible values for \(\lambda\) and to compute the cross-validation error rate for each value of \(\lambda\).
Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.
Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Slides 9 and 17. Right: The corresponding lasso coefficient estimates are displayed.
Both Ridge and Lasso regression can be seen as particular cases of Elastic Net regression, i.e.Â
\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \{ (1 - \alpha) \beta_j^2 + \alpha |\beta_j|\}\]
The elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like ridge.
Lasso: glmnet(X, y, family = "gaussian", alpha = 1)
Ridge: glmnet(X, y, family = "gaussian", alpha = 0)